Digital Factory Interview Challenge


—— Predicting Personal Loan Conversion for Thera Bank
Silvia Shixian Yang
Due: 2021-9-19

Table of Contents


Executive Summary

The target of this study is to explore ways to help Thera bank increase their conversion rate in the persaonal loan purchase. Given some customer information, we need to find the factors that are strongly influencing customers' decision in personal loan purchase. We will first perform initial inspection to avoid abnormal data and get a sense of what we are offering. Then, we will analyze the data in univariate and bivariate perspectives, and finally use Logistic Regression to interpret the data.

1. Import and Initial Data Inspection

In this section, data and libraries are imported, and an initial inspection is performed to check the integrity of the dataset. The original .csv file is read as a pandas dataframe, which has 5000 observations and 14 attributes. Firstly, the datatypes of the attributes are checked and adjusted accordingly based on their variable types (categorical or numeric) and variable uniqueness. Then, abnormal values are handled based on the inspection of the statistical summary.

The following table summarizes the information of the attributes after the analysis of this section, which can also be found in the end of this section.

Description Variable Type Data Type # of unique values # of null values
Attribute Name
ID Customer ID numeric int64 5000 0
Age Customer's age in completed years numeric int64 45 0
Experience #years of professional experience numeric int64 47 0
Income Annual income of the customer ($000) numeric int64 162 0
ZIP Code Home Address ZIP categorical object 467 0
Family Family size of the customer numeric int64 4 0
CCAvg Avg. spending on credit cards per month (\\$000) numeric float64 108 0
Education Education Level. 1: Undergrad; 2: Graduate; 3: Advanced/Professional categorical object 3 0
Mortgage Value of house mortgage if any. (\\$000) numeric int64 347 0
Personal Loan Did this customer accept the personal loan offered in the last campaign? categorical object 2 0
Securities Account Does the customer have a securities account with the bank? categorical object 2 0
CD Account Does the customer have a certificate of deposit (CD) account with the bank? categorical object 2 0
Online Does the customer use internet banking facilities? categorical object 2 0
CreditCard Does the customer use a credit card issued by the bank? categorical object 2 0

2. Exploratory Data Analysis (EDA)

In this section the data distribution of the attributes is studied. Interesting findings regarding the data are:

Then data for Logistic Regression is prepared by dropping unnecessary and highly-correlated attributes, and categorical variables with more than 2 values are transformed to dummy variables for the model.

So, our model has the target variable as Personal Loan and 11 explanatory variables:

3. Prediction with Logistic Regression

In this section, model is trained and predicted on the test group. The result is performing well. The model correctly predict 95% of the test group with an F1-score as 75%. The reason for the low F1-socre is due to the low Recall as 68% for the people who purchased the loan. In another word, our model only predicted 68% of the loan buyer right. Whatsoever, this value is still strong.

The coefficients of the explanatory variables indicate roughtly the same thing as from our bivariate analysis. Lower age, higher income, higher family size, and higher CCAvg customers tend to have higher probability of purchasing the personal loan. Mortgage has very little influence in the prediction. In comparison, categorical values (except City) have more influence in the result that except CD Account, all others have negative coefficients. For customers who have CD account have a high probabiliy in purchasing the personal loan. It is the opposite for all other categorical variables.

4. Conclusion

Possible suggestion for the bank is that for the future campaign, Thera bank could focus more on customers who have high annual income and monthly credit card spending, large family size, and are also CD Account holders. These people tend to have high probability in purchasing the loan based on the given data. The bank can promote some by-side benefit target on this group of people in order to increase the conversion rate for these people.

1. Import and Initial Data Inspection

1.1. Data and Library Import

Libraries used in this notebook are as following. If the module not found, please uncomment and run the corresponding pip install command in the next cell.

1.2. Shape

The shape of the dataset is 5000 by 14 - 5000 observations and 14 attributes.

1.3. Datatypes

The attribute information given in the statement informs the underlying variable types of the attributes. We have 14 attributes in total, consists of 7 numeric and 7 categorical attributes. Below is a table of the summarized attribute information.

According to the initial check, pandas reads most attributes as 64-bit integers, except CCAvg as 64-bit float. However, we have several categorical attributes who are represented by numbers. So, the data types of the categorical attributes need to be changed to "object" type for further analysis.

Now, we have the data types matching the variable types.

1.4. Unique Value Check

Upon checking the number of unique values in each column, we care more about categorical data and the attribute ID, since they are the ones we have expected number of unique values in mind. Whereas, numeric data is either continuous or discrete with a lot of values, so there is no need to analyze their unique values.

ID has 5000 unique values, indicates that each customer has a unique ID, which is desirable. For the categorical data, we expect to see matched number of unique values with the description. Education should have 3 unique values, since it is a 3-choice question. Personal Loan, Securities Account, CD Account, Online, and CreditCard should all have 2 unique values, since they are true-or-false questions. These are all matched.

1.5. Null Value Check

For the null values, the dataset is free from null values for all attributes as a quick check below. So, we do not need to do any null value processing.

1.6. Statistical Summary

The descriptive statistics is checked differently for numeric and categorical attributes.

1.6.1. Numeric Data

For each numeric attribute, count, mean, std, min, max and quantiles are summarized as below. Except Experience, all other attributes seems to have reasonable statistics, which are positive and have identical counts as the number of total observations.

Experience Attribute Handling

The number of years of professional experience has a minimum as low as -3, which is abnormal and seems to be an incorrect imputation. Taking a closer look, there are 52 negative values in this attribute with 33 "-1"s, 15 "-2"s and 4 "-3"s.

A person could not have negative years of work experience under common sense. Another possible explanation for such a setting might be that it aims to impute missing data to represent something else, like years to graduate if assuming that one will start to work immidiately upon graduation. Then, 1 year from expected graduation date might end up with -1 years of work experience. Whatsoever, since we do not have enough information under this case, we will treat the negative cases as 0, meaning these people have no work experiences.

Since we do not have enough information under this case,

1.6.2. Categorical Data

The descriptive statistics for the categorical data looks fine. All of them have the same count as the observations of the dataset. The "unique" row has been analyzed in the Section 1.4, please refer to the analysis there. "top" row represents the category with the highest count, whose count is the "freq" row. These two statistics look fine and reasonable for all of the categorical data.

Summary of Data Inspection

2. Exploratory Data Analysis (EDA)

Our goal is to explore a way of converting more liability customers to personal loan customers, i.e. to predict whether a customer will buy the personal loan. Thus, the target variable of our interest is "Personal Loan." In this section, the data distribution of the attributes will be studied, and interesting attributes will be closely examined. Bivariate analysis will be performed to find potential predictors and prepare for our prediction model.

Under the statement, we will use Logistic Regression as our prediction model. Therefore, we will drop some attributes to satisfy the assumptions of Logistic Regression and help the model to give better prediction. The assumptions of Logistic Regression are:

  1. Binary target variable,
  2. Independent observations,
  3. No multicolinearity among explanatory variables,
  4. No extreme outliers,
  5. Linear relationship between explanatory variable and taget variable,
  6. Sample size is large.

The ones we need to pay special attentions are number 3, 4 and 5, in which 3 will be checked in bivariate analysis, and 4 will be check in univariate analysis.

2.1. Univariate Analysis - Distribution of Numeric Data

We can take a rough glance of all of the histograms of 6 numeric attributes as the following.

ID is just a sequence of natural numbers from 1 to 5000, so it has a uniform distribution as expected, and nothing more. Whereas, other 5 attributes have more interesting distribution and worth a close-look examination. Family attribute has only 4 values. It can be observed that most cutomers only have 1 family member, but there is no big differences among the number of people in each family size. Each varies by roughly 200 people.

All other variables are showing pretty interesting distribution. So, we will take a close look at the distribution of Age, Experience, Income, CCAvg and Mortgage in the following subsections.

2.1.1. Age

The histogram on the top shows the count of customers in each age from minimum 23 to maximum 67. Values are continuous with no gap.

It can be observed that the age has a roughly symmetric distribution with most of the customers clustered between 30 and 60 years old and short tail on both sides. Such symmetry can also be observed from the boxplot that there is no obvious skewness in the distribution.

This indicates that the customer group of Thera Bank is mainly the middle-age people who are from 30 to 60 with a mean of 45 years old.

2.1.2. Experience

The histogram shows the count of customers in each number of years of professional experiences from minimum 0 to maximum 43. Values are continuous with no gap.

The distribution is roughly even for numbers less than around 35, and then shows a gradual decline as number gets bigger. This is also reflected in the boxplot that the distribution is slightly skewed-right.

This makes sense, because as working years pass 40, people starts to retire. So, there are fewer and fewer customers who have more than 35 years of work experience.

2.1.3. Income

The annual income of the cutomers shows a right-skewed distribution that most of the customers have an annual income from \$39,000 to \\$100,000.

2.1.4 CCAvg

The average spending on credit cards per month shows a heavily right-skewed distribution. Majority of the customers have average spending below \$2.000, and 2.12\% of the customers have zero credit card spending per month. Due to heavy cluster of customers on the left side, the upper fence is identified as \\$5,200 which results in a lot of outliers.

2.1.5. Mortgage

There are 3462 customers, which consists of 69.24% of the 5000 customers, have no house mortgage. Thus, the histogram and boxplot does not make sense due to such a huge number of zeros.

If we only select the customers have positive value in house mortgage, we can observe that the distribution is heavily right-skewed. Most customers have relatively low house mortgage, and as the value increases, the number of people rapidly drops.

The distribution containing zero mortgage is rich with outliers, that is not very advantageous for our model. We will further check the predictability of this attribute towards the purchase action of personal loan in the Section 2.3 to see if we need to drop this attribute for our model.

2.2. Univariate Analysis - Distribution of Categorical Data

We can examine the 8 categorical attributes through the bar plots of their value counts.

Zip Code

Region might have a power in representing people's wealth level, and might end up influencing people's decision of purchasing the personal loan. Since there are 467 zip codes, we will wind it down and make it easy for us to performa analysis.

There are 34 zip codes not found in the database, which are as following. Since the number is small, we will name them as "Not Found", and assume that these people live in a city named "Not Found".

There is only one state for the database, which is California. There are 2, because we set the zip code that are not found as "Not Found" string. Thus, we will use the City attribute, and numbering the cities from 1 to 243 in an order of counts.

Los Angeles has the most number of customers from this sample, which is 7.5% of the sample. San Diego, San Francisco and Berkely have relatively the same number of customers in this sample, each is roughly 5% of the sample.

2.3. Bivariate Analysis - Numeric Data

In this section, we will analyze numeric data through pair-wise scatter plot with Personal Loan labeled in different color, correlation matrix and pair-wise t-test. Through this analysis, we will identify the potential predictors and prepare for our model's explanatory variables.

2.3.1. Pair-wise Scatter Plot

A pairplot is drawn as below for numeric data except ID. Kernal density estimate is plotted along the diagonal. Customers who have purchased the personal loan is marked as orange, otherwise as blue.

It is interesting to observe that Age & Experience pair is showing a clear positive linear relationship, and both Income & CCAvg and Income & Mortgage forms a triangular shape. These are consistent with our common sense that age is proportional to years of professional experience; people who have high income tends to have more credit card spending and more house mortgage due to their affordability, whereas it is impossible for one who has low income and spend a lot of money using credit card and even buy a house with house mortgage.

As of the distribution of personal loan purchase, Income, CCAvg and Family seem to be related to the decision of personal loan purchase in the last campaign. People with high income and credit card spending tend to purchases the personal loan in the last campaign. Higher family size group seems to have more customers who purchased the personal loan.

2.3.2. Correlation Matrix

From the correlation matrix above, Experience&Age pair does have very strong correlation (0.99). Recall that Experience has negative values in the orginal dataset, so between these two attributes, we will drop Experience attribute for our model. CCAvg&Income pair also has a relatively high correlation (0.65), but their scatter plot does not show a strong colinearity, so we will keep them.

2.3.3. Pair-wise Student t-test

We perform the paired sample t-test on the numeric data and check if their means in the two groups are significantly differing from each other. The null hypothesis $H_0$ for this test is $\mu (X_1) = \mu (X_2)$, where $X_1$ and $X_2$ are the two variables representing people who purchased and not purchased the personal loan in each attribute. If the p-value is smaller than 0.05, then we reject the null hypothesis, which means that the mean of $X_1$ and $X_2$ are significantly different.

By performing the paired sample t-test on the numeric data, Income, Family, CCAvg and Mortgage give every small p-values, which means that under the assumption that these attributes follow student-t distribution, their numbers for people who purchased and not purchased personal loan are significantly different. This indicates that they might be pretty good predictor for Personal Loan.

Also note that as studied in Section 2.1.5, Mortgage has 3462 (69.24%) zero values. Although this is a sign of outlier, we cannot drop this attribute. As calculated above, among people who did not purchase personal loan, there are 3150 of them who have zero mortgage, which is 63% of the whole sample. Therefore, this attribute is still a good predictor for Personal Loan, and cannot be dropped.

2.4. Bivariate Analysis - Categorical Data

We are now interested in knowing if the categorical variable has a predictive value in Personal Loan. So, we will examine the number of people who purchased the loan or not for each categorical variable, and at the same time perform the chi-square hypothesis test.

Chi-square hypothesis test aims to test whether the count of the target variable in different categories is the same. Thus, if the test yields a p-value lower than 0.05, we could say that the target variable counts of different categories are significantly different.

Since there are over 100 cities, we will not plot the countplot for it, instead just perform the chi-square test.

From the results above, Education and CD Account show a positive result from the test that the count in Personal Loan differs significantly accross groups.

2.5 Model Preperation

According to the analysis in the above section, the following attributes will be dropped for our model.

We will use City attribute, instead of ZIP Code, which would be broader in a sense for region, and better for data interpretation.

We will create dummy variables for the categorical data that have more than 2 categories, that would be City and Education.

3. Prediction with Logistic Regression

According to the analysis in the above section, the following attributes will be dropped for our model.

3.1. Metrics and Confusion Matrix

The model has a pretty high accuracy score as 95%, and F1 score as 75%. ROC AUC score is also pretty high, which is 83%.

3.2. Coefficients of Explanatory Variables

As expected, in numeric data, Family, Income and *CCAvg have relatively large coefficient, especially for Family, which means that higher they are, it is more likely for the customer to purchase personal loan. The coefficient for Age is a small negative number, which means that elder customers tend to purchase the loan, but the influence of the age is relatively small. It is interesting to see that Mortgage has very small coefficient, which is due to the large number of zeros in the data that making its influence to the personal loan purchase weak.

For the categorical data, CD Account and Education_1 have very large absolute coefficient, which are higher than 3. This means that people who have CD Account will be more likely to purchase the personal loan. People who are undergraduate are highly likely not to purchase the personal loan. These are also consistent with our findings in bivariate analysis for categorical data in Section 2.4. Other coefficients are all pretty high comparing to those of numeric variables, in which Security Account, Education_2 and Education_3 all have coefficients less than -0.8, which means that people who do not have security account, not Grad or Advanced tend not to purchase loan. This is also resonable, since from Section 2.4, they all have large proportion of the group who did not purchase the loan.

Taking a glance at the state of the coefficient of City dummy variables, the mean is roughly -0.02, with min as low as -0.97 and max as large as 1.08. So, let us take a look at those cities which have absolute coefficients larger than 0.8.

Overall, the most influencial coeffients are actually Education and CD Account who have absolute coefficients larger than 3.

It is interesting to see that cutomers from these cities are pretty influential in person loan purchase, especially Los Gatos, Cardiff By The Sea, Caslsbad, Monterey, and South Sanfrancisco. However, if we check the customer count for the number of customers in these region, there are pretty few customers (all less than 100). So, we could not draw any conclusion on the influence of regions.

3.3. Wrong Predictions by the Model

There are 69 wrong classifications by the model, in which there are 19 false negative, i.e. not purchased but predicted as purchased, and 50 false positive, i.e. purchased but predicted as not purchased.

It is interesting to see that for the categorical stats, the categority with the top count for both wrong prediction cases is the same. Education is 1, Securities Account is 0, CD Account is 0, Online is 1, and CredictCard is 0, and they all have pretty high frequency. The only different is the top count City, but the frequencies are both pretty low. Thus, we can say that the wrong predictions are mainly caused by the numeric variables.

Recall from the coefficients of our model that lower age, higher income, higher family size, and higher CCAvg tend to have higher probability of purchasing the personal loan. Mortgage has little influence in the prediction.

It can be observed that for false negative cases, comparing to their univariate distribution, average Age is 38.37, which is relatively small, average Income as 170.11 which is very large (out of the upper fence), Family has a mean of 1.95 but with high standard deviation, and CCAvg is very high as 5.79 (out of the upper fence). The trend in Age, Income, and CCAvg of these false negative cases leads them to be predicted as purchases, but actually they did not.

It can be observed that for false negative cases, comparing to their univariate distribution, average Age is 45.48, which is around the middle and larger than that of the false negative case, average Income as 124.12 which is also pretty large (out of the upper fence), but comparing to the false negative cases, it decreases, Family has a mean of 2.7 but with high standard deviation, and CCAvg is a little high as 3.77 but within 75% quantile, and Mortgage is very high as 106.14 which is above the average but has high standard deviation as well. These are all signs of a low probability of purchasing the loan. So, our model predicts them as not purchased, but actually they did.

4. Conclusion

From EDA, we see that higher annual income, higher average monthly spending on credit cards, higer education level, and certificate of deposit account holder have significantly higher tendency in purchasing the personal loan, and this is also reflected in our model that these atributes have pretty large absolute coefficients.

For the future campaign, Thera bank could focus on this group of customers and promote some by-side benefit that could convert more people from these groups to purchase the loan.